test(pt_expt): tolerance for test_default_fallback (CUDA scatter nondeterminism)#5508
Conversation
…eterminism) test_default_fallback compared neighbor_list=None against an explicit DefaultNeighborList() with exact equality (assert_array_equal). The two are the identical builder, so on CPU the results are bit-identical, but the test runs two independent forward passes and on CUDA the dpa3 GNN message-passing scatter (atomic adds) is not bit-reproducible run-to-run -- the virial differed by ~1 ULP (abs 3.5e-18, rel 5.6e-16), failing the exact comparison intermittently. Switch to assert_allclose with a tight tolerance (rtol=1e-10, atol=1e-12), far above the fp noise floor but orders of magnitude tighter than any real dispatch divergence would be.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughTest utility updates the ChangesNeighbor List Fallback Test Tolerance
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5508 +/- ##
==========================================
- Coverage 81.42% 81.42% -0.01%
==========================================
Files 871 871
Lines 96951 96951
Branches 4241 4243 +2
==========================================
- Hits 78941 78938 -3
- Misses 16708 16710 +2
- Partials 1302 1303 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
5f848ad
Summary
test_default_fallback(added in #5491) comparedneighbor_list=Noneagainst an explicitDefaultNeighborList()with exact equality (assert_array_equal). The two are the identical builder —call_commondoesbuilder = nl if nl is not None else DefaultNeighborList()— so on CPU the two forward passes are bit-identical. But the test runs two independent forward passes, and on CUDA the dpa3 GNN message-passing scatter (atomic adds) is not bit-reproducible run-to-run. The virial differed by ~1 ULP (abs3.46e-18, rel5.64e-16), failing the exact comparison intermittently:Fix
Switch the comparison to
assert_allclose(rtol=1e-10, atol=1e-12)— far above the fp noise floor (~1e-16rel) but orders of magnitude tighter than any real dispatch divergence (e.g. accidentally using a different builder) would produce. Verified on CPU thatNoneandDefaultNeighborList()are bit-identical (max|Δ| = 0), confirming the residual is CUDA atomic nondeterminism, not a dispatch bug.Known limitations
test_default_fallbackpass). The failing CUDA path could not be re-validated locally (no GPU available this session); the CI CUDA job is the confirmation. The tolerance is ~6 orders of magnitude above the observed noise, so it is robust.test_pt_expt_equivalencetests already use a1e-9tolerance.Summary by CodeRabbit
Release Notes